37 research outputs found

    Learning Heterogeneous Similarity Measures for Hybrid-Recommendations in Meta-Mining

    Get PDF
    The notion of meta-mining has appeared recently and extends the traditional meta-learning in two ways. First it does not learn meta-models that provide support only for the learning algorithm selection task but ones that support the whole data-mining process. In addition it abandons the so called black-box approach to algorithm description followed in meta-learning. Now in addition to the datasets, algorithms also have descriptors, workflows as well. For the latter two these descriptions are semantic, describing properties of the algorithms. With the availability of descriptors both for datasets and data mining workflows the traditional modelling techniques followed in meta-learning, typically based on classification and regression algorithms, are no longer appropriate. Instead we are faced with a problem the nature of which is much more similar to the problems that appear in recommendation systems. The most important meta-mining requirements are that suggestions should use only datasets and workflows descriptors and the cold-start problem, e.g. providing workflow suggestions for new datasets. In this paper we take a different view on the meta-mining modelling problem and treat it as a recommender problem. In order to account for the meta-mining specificities we derive a novel metric-based-learning recommender approach. Our method learns two homogeneous metrics, one in the dataset and one in the workflow space, and a heterogeneous one in the dataset-workflow space. All learned metrics reflect similarities established from the dataset-workflow preference matrix. We demonstrate our method on meta-mining over biological (microarray datasets) problems. The application of our method is not limited to the meta-mining problem, its formulations is general enough so that it can be applied on problems with similar requirements

    Stability of feature selection algorithms: a study on high-dimensional spaces

    Get PDF
    With the proliferation of extremely high-dimensional data, feature selection algorithms have become indispensable components of the learning process. Strangely, despite extensive work on the stability of learning algorithms, the stability of feature selection algorithms has been relatively neglected. This study is an attempt to fill that gap by quantifying the sensitivity of feature selection algorithms to variations in the training set. We assess the stability of feature selection algorithms based on the stability of the feature preferences that they express in the form of weights-scores, ranks, or a selected feature subset. We examine a number of measures to quantify the stability of feature preferences and propose an empirical way to estimate them. We perform a series of experiments with several feature selection algorithms on a set of proteomics datasets. The experiments allow us to explore the merits of each stability measure and create stability profiles of the feature selection algorithms. Finally, we show how stability profiles can support the choice of a feature selection algorith

    Learning to extract relations for protein annotation

    Get PDF
    Motivation: Protein annotation is a task that describes protein X in terms of topic Y. Usually, this is constructed using information from the biomedical literature. Until now, most of literature-based protein annotation work has been done manually by human annotators. However, as the number of biomedical papers grows ever more rapidly, manual annotation becomes more difficult, and there is increasing need to automate the process. Recently, information extraction (IE) has been used to address this problem. Typically, IE requires pre-defined relations and hand-crafted IE rules or annotated corpora, and these requirements are difficult to satisfy in real-world scenarios such as in the biomedical domain. In this article, we describe an IE system that requires only sentences labelled according to their relevance or not to a given topic by domain experts. Results: We applied our system to meet the annotation needs of a well-known protein family database; the results show that our IE system can annotate proteins with a set of extracted relations by learning relations and IE rules for disease, function and structure from only relevant and irrelevant sentences. Contact: [email protected]

    An Overview of Strategies for Neurosymbolic Integration

    No full text
    At the crossroads of symbolic and neural processing, researchers have been actively investigating the synergies that might be obtained from combining the strengths of these two paradigms. Neurosymbolic integration comes in two flavors: unifed and hybrid. Unified approaches strive to attain full symbol-processing functionalities using neural techniques alone while hybrid approaches blend symbolic reasoning and representational models with neural networks. This papers attempts to clarify and compare the objectives, mechanisms, variants and underlying assumptions of these major integration approaches. 1 Introduction Throughout its brief history, the field of artificial intelligence (ai) has been the arena of jousts between two fr`eres ennemis, symbolicism and connectionism. No sooner had connectionism recovered from [ Minsky and Papert, 1969 ] 's devastating blows than Fodor and Pylyshyn charged to the fore in the name of symbolic ai. They argued that connectionism cannot be a valid th..

    Distances and (indefinite) kernels for sets of objects

    No full text
    For various classification problems involving complex data, it is most natural to represent each training example as a set of vectors. While several distance measures for sets have been proposed, only a few kernels over these structures exist since it is difficult in general to design a positive semidefinite (PSD) similarity function. The main disadvantage of most existing set kernels is that they are based on averaging, which might be inappropriate for problems where only specific elements of the two sets should determine the overall similarity. In this paper we propose a class of kernels for sets of vectors directly exploiting set distance measures and, hence, incorporating various semantics into set kernels and lending the power of regularization to learning in structural domains where natural distance functions exist. These kernels belong to two groups: (i) kernels in the proximity space induced by set distances and (ii) set distance substitution kernels (non-PSD in general). We report experimental results which show that our kernels compare favorably with kernels based on averaging and achieve results similar to other state-of-the-art methods. At the same time our kernels bring systematically improvement over the naive way of exploiting distances.

    M.: Distance-based learning over extended relational algebra structures

    No full text
    Abstract. In (Kalousis et al., 2005) we presented a novel unifying framework for relational distance-based learning where learning examples are stored in a relational database. This approach is based on concepts from relational algebra and exploits the notion of foreign keys associations to define a new attribute of type set. We defined several relational distances whose blocks are distances between tuples of relations and distances between sets. In this paper we extend this relational algebra representation language such that it allows for modeling of lists of complex objects (relational instances in our case). We define a new type of foreign keys associations which, in addition to attributes of type set, gives rise to a new attribute of type list. We extend the well known alignment-based edit distance measure on lists to fit within our framework. Our extended distancebased learning algorithm in tested on a protein fingerprint classification dataset for which promising results are reported.

    Representational issues in meta-learning

    No full text
    To address the problem of algorithm selection for the classification task, we equip a relational case base with new similarity measures that are able to cope with multirelational representations. The proposed approach builds on notions from clustering and is closely related to ideas developed in similarity-based relational learning. The results provide evidence that the relational representation coupled with the appropriate similarity measure can improve performance. The ideas presented are pertinent not only for meta-learning representational issues, but for all domains with similar representation requirements. 1

    Matching Based Kernels for Labeled Graphs

    No full text
    Abstract. For various classification problems it is most natural to represent training examples as labeled graphs. As a result several kernel functions over these complex structures have been proposed in the literature so far. Most of them exploit the Cross Product Kernel between two sets resulting from the decompositions of corresponding graphs into subgraphs of a specific type. The similarities between the substructures are often computed using the 0 βˆ’ 1 Kronecker Delta Kernel. This approach has two main limitations: (i) in general most of the subgraphs will be poorly correlated with the actual target variable, adversely affecting the generalization of a classifier and (ii) as no graded similarities on subparts are computed, the expressivity of the resulting kernels is reduced. To tackle the above problems we propose here a class of graph kernels based on set distance measures whose computation is based on specific pairs of points from the corresponding graph decompositions. The actual matching of elements from the two sets depends on a graded similarity (standard Euclidean metric in our case) between the elements of the two sets. To make our similarity measure positive semidefinite we exploit the notion of the proximity space induced by a given set distance measure. To practically demonstrate the effectiveness of our approach we report promising experimental results for the task of activity prediction of drug molecules.

    Learning to Combine Distances for Complex Representations

    No full text
    The k-Nearest Neighbors algorithm can be easily adapted to classify complex objects (e.g. sets, graphs) as long as a proper dissimilarity function is given over an input space. Both the representation of the learning instances and the dissimilarity employed on that representation should be determined on the basis of domain knowledge. However, even in the presence of domain knowledge, it can be far from obvious which complex representation should be used or which dissimilarity should be applied on the chosen representation. In this paper we present a framework that allows to combine different complex representations of a given learning problem and/or different dissimilarities defined on these representations. We build on ideas developed previously on metric learning for vectorial data. We demonstrate the utility of our method in domains in which the learning instances are represented as sets of vectors by learning how to combine different set distance measures. 1
    corecore